"Validating the Map" by Sanjay Bhangar Live captioning by Norma Miller. @whitecoatcapxg All right, all right, so please join me in welcoming Sanjay from MapBox. He's going to talk about validating the map. [applause] >> Hey, everyone. Am I audible? Can I be heard? Yup, yeah, it's so good to be here, it's my third State of the Map. I'm not sure why trajectories on the world keep bringing me here but it's the first time on this side of the fence so I'm excited to be talking about some of the work we've been doing over the last several months at looking at approaches to better validate the data on the map, right? You know, I know most of us know this, but I think it's always nice to pause and just take in these numbers. When I first saw OpenStreetMap, you know, as a project, about eight years ago, I think, this was unimaginable. It was, you know what? We're going to try and create, like have people map everything in this world? Really? Can that work? We've gotten pretty far. Those are numbers that I almost don't know how to say, like 3 billion something nodes? And you know, we've been focused on mapping, right? Like it's like how do we complete this map? It's an unimaginably hard task, and somehow people have come together and we've come this far, so you know, we're potentially moving to another -- I wouldn't say there's still so much left to be mapped, but we are moving to a phase in the project where people are seriously depending on this. There's end users who are using this for directions, for base maps, millions of people are looking at this map every day and how can we improve processes to you know, detect breakage, validate the data on the map, and so that's what my talk is going to be about, speak about some things that we've been working on, and you know, and ask the community how we can work together better on this problem, because of course, there's a lot of edits every day, the world is a big place, and you know, none of this can do this alone. So I'm really hoping to have, you know, a lot of questions and discussions at the end and there's also a something that we'll be working on at the end of the day. I really want this to be as open-ended as possible. Most of us know this, you can go to OpenStreetMap, you can sign up, you can use your editor of choice. It's easy as this to draw things on the map, draw roads, draw buildings, mark your favorite restaurant, whatever you want to do, and you know, the first -- the first thing people asking when you tell them that this is possible is like, but isn't it always breaking? Aren't people always breaking things? Aren't -- I don't know, people drawing crazy things on the map? How does this work? I mean it's the same questions people had with Wikipedia when Wikipedia started. It's the same question people have with the map. So how does this not break? I mean mostly this community is amazing, right? We all know this. We've -- I'll go through some of our process, but the short answer is that the community is amazing at detecting and fixing problems really fast. And as we get more editors on board, as we expand, you know, how can we encourage this more in the community? There are structures, the mailing list works really well to report when you see a user kind of vandalizing the map or doing something that's intentional. There's pretty good procedures for warning a user, starting off friendly, you know, the data working group, they do a great job, you know, first kind of temporarily block a user and say look, just read what we have to say, then you can sign in start editing again, user continues doing something, they get blocked, this works pretty well and most bad edits that happen, and they happen, editing geometry is hard. It takes practice, people who've been editing for a lot of years make mistakes, so a lot of bad edits are accidental, right, there isn't as much intentional vandalism. This could change. When we started looking at this, I didn't know what kind of can of worms we'd be opening up when we actually started looking at stuff more in detail. I think the overall good news is there isn't much vandalism, but how can we prepare for a time when there might be, when people think oh, it might be fun to go and do something. Whatever. Why is this important to us? Of course I'm doing this as part of my day job, which is great, I get paid to do stuff that I wanted to do anyways, it's amazing. [laughter] So we at MapBox, what do we do? We serve MapBox streets, which is, you know, OpenStreetMap, which can be styled, which a lot of our customers use. We actually update this in real time in OpenStreetMap so from the time an edit it is made in OpenStreetMap to the time it goes live to our customers who are using our streets product is about ten minutes, right? So it's really important to us to catch these things as soon as possible, and to fix them and to, you know, make sure our customers don't see broken maps. We provide directions using OpenStreetMap data, we've got an open source project called OSRM. Again we don't want people to drive into a lake or something, so it's pretty important for us that the map doesn't break. And of course, breakages result in broken maps for millions of our end users and this is what the maps look like and you know, we want to keep them looking pretty. Some of our customers, we've got some pretty big customers, Pinterest, FourSquare, The Weather Channel uses us. You know, directions we provide a routing service, there's a few different customers of our routing service, something we're working really hard on an you know, it would be really sad for us if our routing broke every few hours. So what are we looking at? How are we tackling this? So I work in the Bangalore office in MapBox. We've been spending a few hours every day of let's start looking at this problem, it's great to be able to have eyes on the map to start sussing out what this problem looks like better. So we do daily manual reviews based on some of other tools which I'll get into. It's not a random sampling of edits, but you know, I'll get into why it's been kind of hard to like actually like crystalize definite bad edits, but we're trying to get a map of what it looks like. What happens around bad edits, how can we write better software to fix these but right now let's look at a lot of things and identify what problems are, and of course as we find problems, fix them. I've been iteratively building upon, you know, building upon a great ecosystem of existing tools, looking at building our own tools, and playing around with different approaches, right and importantly identifying the kinds of problems and patterns in bad edits so that we can start building up, you know, like a corpus of bad edits for other people to use to sort of like base machine-learning approaches and stuff like that on. What are the types of errors we encounter? I could spend the entire sort of session talking about details of types of errors, but broadly there are a lot of newbie errors, you know, people are going to accidentally move a node, they don't know how the tagging structure works. Does anybody really know how the tagging structure works? [laughter] But you know, there will be mistakes, these times we try and be as nice as possible. You know, send them a message, thank you so much for editing the map, this is how it can be improved, if we can help you, please let us know. A very common thing is license violations. We just look at changeset comments, people who say, oh, yeah, we saw this in Google Maps and we just put it here, and it's like, you know, that's really sweet of you but that violates the license. Please don't do that. We notice that happens quite a lot. There's an iD that notifies users, if you're copying this from Google, please don't do that. There's accidental breakages a lot of and you spend a lot of time moving geometries, just moving one node can break things significantly, which you know, happens all the time and we try and detect that and of course there's a very, very, very small subset of intentional damage, right? Intentional damage includes stuff like people naming roads after themselves, like why not? You know all roads can be called Sanjay Bhangar road, that would be so nice. There's people putting advertising links to their website, stuff like this. There's people just messing around. Some people really like the fact that it's a drawing tool, you can do art, so there was like a big, like really -- it's like so sad to remove it because someone spent a lot of time on that, but you know, a stork in the middle of Russia that was beautifully hand crafted but there's like open fictional map or so, which is a great place for people who want to do that, but it doesn't belong in OpenStreetMap, right so just some kind of things, you can see like a single node movement can really visually damage the map. This was something that happened a few months ago, a single node resulting in you know the middle of Manhattan major visual damage. This was fun -- hum why not, right? This stuff is it hard to detect. I'll come to speak become that you and you know, newbie errors happen all the time. But let's focus on being nice to them. We want to welcome more people to the community. We all made those mistakes when we started off. Right, so these are not comprehensive numbers, but just to give you an idea, we try and review, you know about 80 to 100 changeset every day in the team, go over and you know, we -- we've got -- I'll show you the tool that we're kind of using so we do like to review changesets that have suspicious-looking comments or do like mass additions or but the percentage that we find that have problems are less than 2.5% and a lot of the problems are really small, but these are what some of those numbers look like by you know, different suspicion categories that we put them in and try and review as much as we can. How we respond, as I said, you know, be as friendly as we can in changeset comments. Most -- almost all the time it's a genuine error when we want to be as nice to the person as possible JOSM has a great revert tool so you can put in the changeset ID. Please read all the caveats associated with that. If those have been subsequently changed, you need to be sure that everything is OK after your reversion, but it's a pretty easy tool to just reverse a changeset, so you can look up the JOSM revert program, it's great. If we do notice something intentional, which most of the time somebody has already noticed that before us, the community is just amazing. And then we will escalate it to like the mailing list, to the DWG, which is usually really great at responding and as I said, most often problems are already fixed so we still do try and collect this data as much as possible, to try to build up this corpus, so even if a problem is already fixed, we make a note to it, plug it into a spreadsheet where we've got a repository to build up this corpus of bad edits to see what we can do with it in the future. Some of the existing tools that we depend on affect data, as well. There's HTYC, how did you contribute by Pascal N that lets you look up amazing detail user histories in OpenStreetMap. Who did it that lets you kind of zoom into a bounding box. See changesets by a time period and get details for those changesets. It lets you export an RSS feed of this that can be really useful for keeping a track of changes in your neighborhood, for example. osmose, I hope I pronounced that right, and keep right, which try and flag common errors on the map, right, just quick screenshots of these. How did you contribute, -- who did it, this can select your area, you can see the number of changesets. Osmose, and keep right, that just flags common errors, lets people go in and fix them. These tools are great. We wanted things that we could iterate on a little bit more and use different approaches, so we use a mixture of those tools and our tools and I'll come to how when I work with you people to try to consolidate these efforts. So our tools are based on three changes. Changeset metadata, which is something called Asmcha-Django. Daily analysis of QA tiles, we have this process called OSM lint and then real time change monitoring is something we're looking at that we run some analysis on. This is what OSM CHA-Django looks like. So you can search by editor. When maps.me we spent some time searching for maps.me changes every day, for example. We built something called changeset map which lets you visualize changes on a map. It was built on a great tool but we needed changes. And there's OSM lint as I said lets us process all the tiles for the world, so we do this once a day and find common errors, stuff like roads that are overwater, stuff that roads are intersecting with buildings, things that are kind of just linting the map, right? Things that are happening all the time and things that we want to keep a practice of fixing, right? So this is the -- it depends on a library called tile reduce, which is a great way of spitting up this on multiple places in machines and parallelize this work and do it really fast and we plug it into a micro-tasking tool called to-fix. These are some statistics from when we started working on it the different kinds of errors that we've looked at and how many issues we fixed, how many things that we've reported as false positive. You know, stuff like unconnected major highways of just like highways that were not connected to the main road network, because of you know, TIGER imports and different things that have happened and you know, the team just goes in and it's really quick for us to plop it into to-fix and start fixing stuff so that's something we do daily. Right now there's about 50 to 100 issues on the highways that come up daily. We spend a couple of hours a day linting things like this. Real-time change detection is something we're going to be hacking on a little bit tomorrow. How do we detect the things coming out in real time. This is my ending mark. It's something how do we open source this, it's something we're looking into but right now this is the internal tool that we use. We use some of the data that we collect at MapBox to improve the map. We get telemetry data that we collect to detect incorrect and missing turn restrictions. On directions API we find errors, and we find that maybe that road network doesn't connect with the road network and that's something we can fix, as well. The challenges that we face right now is it's it's a bit fragmented. It's allowed us to quickly iterate on each of these approaches, but frustrating to have it so fragmented. Each of the approaches has its subtle limitations which it will be great to bring together. There's no substantial corpus about it. That's the first thing we hunted for, oh, that there must be someone who's had a corpus that we can build upon. If anyone is working on that, it will be great to talk. The world is very large, you that, we're getting smaller, but the geography of it is getting large. It's not like our data teams are going to be able to keep an eye on the whole world. So how do we do this better, how do we collaborate? How do we consolidate our efforts? It's important to a lot of people in this room that the map stays reliable and is protected from things like vandalism. So how do we work together, how can we build better tools for the community, how can we support that? I'm super interested in figuring out what the community wants, and the fancy learning things that I know very little about. I think there's a lot to be learned from Wikipedia experience. How can we learn a lot from projects like Wikipedia who've obviously done this extremely well. These are some of our code repositories, I've got the stop sign so I'm going to go really fast. How can we consolidate efforts? We're having a BoF at room 107 and a hack day tomorrow, and thank you. [applause] We've got about five minutes for questions, discussions, I really wanted more but -- [speaking off mic] >> So you know, I'm directly not so involved with the directions team, so I may say something wrong. The code is open source. So we use -- we do factor in -- so obviously highway is the most important thing. We do do bicycle routing and road routing. We don't do public transportation yet. We do do walks, so sidewalks are really important to us. We have spent some time and there's a lot of discussion about the proper way to tag sidewalks, so we've been trying to bring some consolidation to that, but yeah, I -- AUDIENCE MEMBER: And a follow-up, how many people are on your team? >> We've got 27 people now. In some parts of the world, [inaudible] get reviewed in the community by a couple of -- [inaudible] [speaking off mic] >> Yeah, this is exactly what I want to find out, right? So I think things that have worked really well, and things that people use is essentially this: Like people monitoring their neighborhood, we know at OSM this is what works, right? People are really passionate about their own neighborhood, people are really passionate about a subset of tags that they are passionate about. Bikers really care about bike lanes, for example, so this I mean this is essentially like right now you can do RSS feeds for a bounding box, but how can we make that more detailed and maybe -- like I love RSS but maybe everyone doesn't and how can we make that more accessible to people. So I throw that question a little bit back to the audience if people have ideas for this. We'll be hacking a little bit on infrastructure and stuff tomorrow that we think can support work like this and of course I'd be interested in getting to know how to get people more excited about monitoring their neighborhoods, right? Because that's been continuously more important. A lot of people are really good at monitoring the maps but we just need to continue gardening, and does that answer the question? When you mentioned the fear that you wouldn't want somebody to drive into a pond, right, what it made me think of course was self-driving cars. Is there any thought of using OpenStreetMap data for self-driving cars and it seems like the proprietary mapping services would have the same kind of quality control, you know, if maybe not as much because they don't have a volunteer core contributing. How much of the errors are the direct result of -- is openness? And how much is the error in OpenStreetMap compared to the say Google Maps? [inaudible] >> Yeah, I think there's -- I mean I think things get fixed really fast in OpenStreetMap. I definitely see that as a strength and not a weakness in terms of it being an open volunteer effort, and like, from -- I mean it's slightly anecdotal, but what we've noticed is that definitely the lag time and things, the response time is really quick on OpenStreetMap. People notice things broken and fix them. I mean to answer the self-driving cars, I mean that was the April fool's joke, right, that the OpenStreetMap with self-driving cars is available, I think we're probably a little bit -- but that's definitely something that is, you know, you know, I mean we're thinking about and lots of people are thinking about. One would definitely have to think about you know, how do we have some premoderation or something when it comes to that state. It's something that we're not there yet and luckily I'm not directly involved in thinking about that or taking responsibility for that, so it's not -- but yeah, it's -- but I definitely see the volunteer effort and community involvement as more of a strength than a weakness in this regard. Like I definitely think we need some safeguards, but hopefully these tools will be a lot more mature. I think the community is seeing that kind of validation detection and having the data open makes it also a lot easier for so many more people to work on detection and stuff like machine learning algorithms. These things have gotten so much better over the past year or two that I think -- that I think we should be prepared when the time comes, yeah. OK, sorry. We're going to have to cut off for now, because I want to make sure everyone gets to lunch. But I encourage you to connect with Sanjay during lunch and ask him some questions. So thank you very much. [break]